-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first draft diirm doc #211
base: master
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,118 @@ | |||
--- | |||
title: "Gen3 - DIIRM Submission" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't like the name DIIRM (not sure how Bob feels about it either at this point). I think we should just call this Gen3 Data Ingestion.
# DIIRM Submission of Data Files | ||
* * * | ||
|
||
The following guide details the steps a data contributor must take to submit project data to a Gen3 data commons with the Data, Ingest, Index, Resource Management (DIIRM) system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data indexing, ingestion and release management
|
||
* * * | ||
|
||
## 1. Prepare Project with the Gen3 sdk tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## 1. Prepare Project with the Gen3 sdk tools | |
## 1. Prepare Project with the Gen3 SDK tools |
|
||
## 1. Prepare Project with the Gen3 sdk tools | ||
* * * | ||
In order to submit data files, a Gen3 project must be present to associate the files to. The [Gen3 Submission sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html) has a comprehesive set of tools to enable users to script submission of projects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true in DIIRM (and something important to point out). You don't need the graph to use most of the SDK code for data ingestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important b/c some projects may only use our Framework Services and not have a full Gen3 Data Commons with a graph
|
||
### Data and Access Considerations | ||
|
||
The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictionso nthe organization of data within could bucket(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictionso nthe organization of data within could bucket(s). | |
The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs place very few restrictions on the organization of data within could bucket(s). |
|
||
Gen3 offers an [Indexing sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/tools/indexing.html#module-gen3.tools.indexing.index_manifest) to build, validate and map all files into a Gen3 datacommons. | ||
|
||
This file should offer meta data as well as bucket mapping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should offer meta data as well as bucket mapping. | |
This file should offer metadata as well as bucket mapping. |
|
||
This file should offer meta data as well as bucket mapping. | ||
|
||
| File_name | File_size | md5sum | bucket_urls | acl | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guid | size | md5 | urls | acl | authz
| examplefile.txt | 123456 | sample_md5 | s3://example-bucket/examplefile.txt gs://example-bucket/examplefile.txt | [phs000001,c1] | | ||
|
||
* * * | ||
To continue your data submission return to the main [Gen3 - Data Contribution](/resources/user/submit-data/#4-submit-additional-project-metadata) page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little misleading b/c this isn't required by DIIRM. The data submission into the graph is treated as a completely separate process and not a required one. DIIRM handles the pure indexing and metadata ingestion isolated from the graph submission
@@ -9,121 +9,37 @@ menuname: userMenu | |||
# Submitting Data Files and Linking Metadata in a Gen3 Data Commons | |||
* * * | |||
|
|||
The following guide details the steps a data contributor must take to submit a project to a Gen3 data commons. Feel free to take a look at our webinars about data submission to our Gen3 data commons on our [YouTube channel](https://www.youtube.com/channel/UCMCwQy4EDd1BaskzZgIOsNQ/videos). | |||
The following guide details two methods a data contributor can take to submit a project and data to a Gen3 data commons. | |||
|
|||
Data in a Gen3 data commons are either stored in variables that are exposed to the API for query (what we refer to as 'metadata') or are stored in files that must be downloaded prior to knowing their content (or 'data files'). For more information on the difference between data files and metadata exposed to the API, see the documentation on the [data dictionary in a Gen3 data commons](/resources/user/dictionary). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of the term metadata
is perhaps confusing with regards to the graph vs metadata service. We may want to adopt more specific naming
|
||
Data files such as spreadsheets, sequencing data (BAM, FASTQ), assay results, images, PDFs, etc., are uploaded to object storage with the [gen3-client command-line tool](/resources/user/gen3-client). | ||
|
||
>__Note:__ if your data files are already located in cloud storage, such as an AWS or GCS bucket, please see [this page](/resources/user/submit-data/sower) on how to make these files available in a Gen3 data commons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OR if you need to support Google.
I think we need a larger disclaimer that this existing Upload Data Files method is very limited in scalability and doesn't work with Google. imo we should push people to use the other method entirely, b/c ideally (once there's better tooling and docs) we should remove the old method to clean up our stack
Jira Ticket: PXP-xxxx
New Features
Breaking Changes
Bug Fixes
Improvements
Dependency updates
Deployment changes